Load packages:
library(tidyverse)
library(ggplot2) # superfluous because ggplot2 is part of tidyverse
library(scales) # for formatting labels for axes and legends
library(haven)
library(labelled)Resources used to create this lecture:
We will use two datasets that are part of the ggplot2 package:
mpg: EPA fuel economy data in 1999 and 2008 for 38 car models that had a new release every year between 1999 and 2008
diamonds: Prices and attributes of about 54,000 diamonds#?mpg
glimpse(mpg)## Rows: 234
## Columns: 11
## $ manufacturer <chr> "audi", "audi", "audi", "audi", "audi", "audi", "aud…
## $ model <chr> "a4", "a4", "a4", "a4", "a4", "a4", "a4", "a4 quattr…
## $ displ <dbl> 1.8, 1.8, 2.0, 2.0, 2.8, 2.8, 3.1, 1.8, 1.8, 2.0, 2.…
## $ year <int> 1999, 1999, 2008, 2008, 1999, 1999, 2008, 1999, 1999…
## $ cyl <int> 4, 4, 4, 4, 6, 6, 6, 4, 4, 4, 4, 6, 6, 6, 6, 6, 6, 8…
## $ trans <chr> "auto(l5)", "manual(m5)", "manual(m6)", "auto(av)", …
## $ drv <chr> "f", "f", "f", "f", "f", "f", "f", "4", "4", "4", "4…
## $ cty <int> 18, 21, 20, 21, 16, 18, 18, 18, 16, 20, 19, 15, 17, …
## $ hwy <int> 29, 29, 31, 30, 26, 26, 27, 26, 25, 28, 27, 25, 25, …
## $ fl <chr> "p", "p", "p", "p", "p", "p", "p", "p", "p", "p", "p…
## $ class <chr> "compact", "compact", "compact", "compact", "compact…
#?diamonds
glimpse(diamonds)## Rows: 53,940
## Columns: 10
## $ carat <dbl> 0.23, 0.21, 0.23, 0.29, 0.31, 0.24, 0.24, 0.26, 0.22, 0.2…
## $ cut <ord> Ideal, Premium, Good, Premium, Good, Very Good, Very Good…
## $ color <ord> E, E, E, I, J, J, I, H, E, H, J, J, F, J, E, E, I, J, J, …
## $ clarity <ord> SI2, SI1, VS1, VS2, SI2, VVS2, VVS1, SI1, VS2, VS1, SI1, …
## $ depth <dbl> 61.5, 59.8, 56.9, 62.4, 63.3, 62.8, 62.3, 61.9, 65.1, 59.…
## $ table <dbl> 55, 61, 65, 58, 58, 57, 57, 55, 61, 61, 55, 56, 61, 54, 6…
## $ price <int> 326, 326, 327, 334, 335, 336, 336, 337, 337, 338, 339, 34…
## $ x <dbl> 3.95, 3.89, 4.05, 4.20, 4.34, 3.94, 3.95, 4.07, 3.87, 4.0…
## $ y <dbl> 3.98, 3.84, 4.07, 4.23, 4.35, 3.96, 3.98, 4.11, 3.78, 4.0…
## $ z <dbl> 2.43, 2.31, 2.31, 2.63, 2.75, 2.48, 2.47, 2.53, 2.49, 2.3…
We will use public-use data from the National Center for Education Statistics (NCES) Educational Longitudinal Survey (ELS) of 2002:
stu_id uniquely identifies observations# variables we want to select from full ELS dataset
els_keepvars <- c(
"STU_ID", # student id
"STRAT_ID", # stratum id
"PSU", # primary sampling unit
"BYRACE", # (base year) race/ethnicity
"BYINCOME", # (base year) parental income
"BYPARED", # (base year) parental education
"BYNELS2M", # (base year) math score
"BYNELS2R", # (base year) reading score
"F3ATTAINMENT", # (3rd follow up) attainment
"F2PS1SEC", # (2nd follow up) first institution attended
"F3ERN2011", # (3rd follow up) earnings from employment in 2011
"F1SEX", # (1st follow up) sex composite
"F2EVRATT", # (2nd follow up, composite) ever attended college
"F2PS1LVL", # (2nd follow up, composite) first attended postsecondary institution, level
"F2PS1CTR", # (2nd follow up, composite) first attended postsecondary institution, control
"F2PS1SLC" # (2nd follow up, composite) first attended postsecondary institution, selectivity
)
els_keepvars## [1] "STU_ID" "STRAT_ID" "PSU" "BYRACE"
## [5] "BYINCOME" "BYPARED" "BYNELS2M" "BYNELS2R"
## [9] "F3ATTAINMENT" "F2PS1SEC" "F3ERN2011" "F1SEX"
## [13] "F2EVRATT" "F2PS1LVL" "F2PS1CTR" "F2PS1SLC"
load(url("https://github.com/anyone-can-cook/rclass2/raw/main/data/els/els.RData"))
els <- els %>%
# keep only subset of vars
select(one_of(els_keepvars)) %>%
# lower variable names
rename_all(tolower)
glimpse(els)## Rows: 16,197
## Columns: 16
## $ stu_id <dbl> 101101, 101102, 101104, 101105, 101106, 101107, 1011…
## $ strat_id <dbl> 101, 101, 101, 101, 101, 101, 101, 101, 101, 101, 10…
## $ psu <dbl+lbl> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, …
## $ byrace <dbl+lbl> 5, 2, 7, 3, 4, 4, 4, 7, 4, 3, 3, 4, 3, 2, 2, 3, …
## $ byincome <dbl+lbl> 10, 11, 10, 2, 6, 9, 10, 10, 8, 3, 8, 8, 5, 8, 1…
## $ bypared <dbl+lbl> 5, 5, 2, 2, 1, 2, 6, 2, 2, 1, 6, 4, 4, 2, 7, 2, …
## $ bynels2m <dbl+lbl> 47.84, 55.30, 66.24, 35.33, 29.97, 24.28, 45.16,…
## $ bynels2r <dbl+lbl> 39.04, 36.35, 42.68, 27.86, 13.07, 11.70, 19.66,…
## $ f3attainment <dbl+lbl> 3, 10, 6, 4, 4, 3, 4, 6, -4, 3, 3, 3, 5, 5, 6, -…
## $ f2ps1sec <dbl+lbl> -8, 1, 1, 4, 4, -3, 4, 2, -4, 4, 1, -4, -4, 4, 2…
## $ f3ern2011 <dbl+lbl> 4000, 3000, 37000, 1500, 48000, 35000, 17000, 68…
## $ f1sex <dbl+lbl> 2, 2, 2, 2, 2, 1, 1, 1, 1, 1, 1, 2, 2, 1, 1, 1, …
## $ f2evratt <dbl+lbl> -8, 1, 1, 1, 1, 0, 1, 1, -4, 1, 1, -4, -4, 1, 1,…
## $ f2ps1lvl <dbl+lbl> -8, 1, 1, 2, 2, -3, 2, 1, -4, 2, 1, -4, -4, 2, 1…
## $ f2ps1ctr <dbl+lbl> -8, 1, 1, 1, 1, -3, 1, 2, -4, 1, 1, -4, -4, 1, 2…
## $ f2ps1slc <dbl+lbl> -5, -5, -5, -5, -5, -5, -5, -5, -5, -5, -5, -5, …
els %>% var_label()## $stu_id
## [1] "Student ID"
##
## $strat_id
## [1] "Stratum"
##
## $psu
## [1] "Primary sampling unit"
##
## $byrace
## [1] "Student's race/ethnicity-composite"
##
## $byincome
## [1] "Total family income from all sources 2001-composite"
##
## $bypared
## [1] "Parents' highest level of education"
##
## $bynels2m
## [1] "ELS-NELS 1992 scale equated sophomore math score"
##
## $bynels2r
## [1] "ELS-NELS 1992 scale equated sophomore reading score"
##
## $f3attainment
## [1] "Highest level of education earned as of F3"
##
## $f2ps1sec
## [1] "Sector of first postsecondary institution"
##
## $f3ern2011
## [1] "2011 employment income: R only"
##
## $f1sex
## [1] "F1 sex-composite"
##
## $f2evratt
## [1] "Whether has ever attended a postsecondary institution - composite"
##
## $f2ps1lvl
## [1] "Level of offering of first postsecondary institution"
##
## $f2ps1ctr
## [1] "Control of first postsecondary institution"
##
## $f2ps1slc
## [1] "Institutional selectivity of first attended postsecondary institution"
Basic definitions:
race)The seven parameters of the layered grammar of graphics consists of:
ggplot2 – part of tidyverse – is an R package to create graphics and ggplot() is a function within the ggplot2 package.
“In practice, you rarely need to supply all seven parameters to make a graph because ggplot2 will provide useful defaults for everything except the data, the mappings, and the geom function.” (Wickham & Grolemund, 2017, Chapter 3)
Syntax conveying the seven parameters of the layered grammar of graphics:
ggplot(data = <DATA>, mapping = aes(<MAPPINGS>)) +
<GEOM_FUNCTION>(
stat = <STAT>,
position = <POSITION>
) +
<COORDINATE_FUNCTION> +
<FACET_FUNCTION>What does Wickham mean by layers? (from “Telling Stories with Data Using the Grammar of Graphics” by Liz Sander)
The five layers of the grammar of graphics:
Data defines the information to be visualized.
Example: Imagine a dataset where each observation is a student
bynels2m), earnings in 2011 (f3ern2011), and student sex (f1sex)els %>% select(stu_id, bynels2m, f3ern2011, f1sex) %>% as_factor() %>% head(10)## # A tibble: 10 x 4
## stu_id bynels2m f3ern2011 f1sex
## <dbl> <fct> <fct> <fct>
## 1 101101 47.84 4000 Female
## 2 101102 55.3 3000 Female
## 3 101104 66.24 37000 Female
## 4 101105 35.33 1500 Female
## 5 101106 29.97 48000 Female
## 6 101107 24.28 35000 Male
## 7 101108 45.16 17000 Male
## 8 101109 66.01 68000 Male
## 9 101110 28.28 Nonrespondent Male
## 10 101111 38.85 42000 Male
Mapping defines how variables in a dataset are applied (mapped) to a graphic.
Example: Consider the previous dataset
els %>% select(stu_id, bynels2m, f3ern2011, f1sex) %>%
rename(x = bynels2m, y = f3ern2011, color = f1sex) %>%
as_factor() %>% head(10)## # A tibble: 10 x 4
## stu_id x y color
## <dbl> <fct> <fct> <fct>
## 1 101101 47.84 4000 Female
## 2 101102 55.3 3000 Female
## 3 101104 66.24 37000 Female
## 4 101105 35.33 1500 Female
## 5 101106 29.97 48000 Female
## 6 101107 24.28 35000 Male
## 7 101108 45.16 17000 Male
## 8 101109 66.01 68000 Male
## 9 101110 28.28 Nonrespondent Male
## 10 101111 38.85 42000 Male
A statistical transformation transforms the underlying data before plotting it.
Example: Imagine creating a scatterplot of the relationship between HS math test score (x-axis) and 2011 income (y-axis)
els %>% select(stu_id,bynels2m,f3ern2011) %>% rename(x=bynels2m, y=f3ern2011) %>%
as_factor() %>% head(10)## # A tibble: 10 x 3
## stu_id x y
## <dbl> <fct> <fct>
## 1 101101 47.84 4000
## 2 101102 55.3 3000
## 3 101104 66.24 37000
## 4 101105 35.33 1500
## 5 101106 29.97 48000
## 6 101107 24.28 35000
## 7 101108 45.16 17000
## 8 101109 66.01 68000
## 9 101110 28.28 Nonrespondent
## 10 101111 38.85 42000
Example: Imagine creating a bar chart of the number of students by race/ethnicity
els %>% count(byrace) %>% as_factor()## # A tibble: 9 x 2
## byrace n
## <fct> <int>
## 1 Survey component legitimate skip/NA 305
## 2 Nonrespondent 648
## 3 Amer. Indian/Alaska Native, non-Hispanic 130
## 4 Asian, Hawaii/Pac. Islander,non-Hispanic 1460
## 5 Black or African American, non-Hispanic 2020
## 6 Hispanic, no race specified 996
## 7 Hispanic, race specified 1221
## 8 More than one race, non-Hispanic 735
## 9 White, non-Hispanic 8682
Graphs visually display data, using geometric objects like a point, line, bar, etc.
Position adjustment adjusts the position of visual elements in the plot so that these visual elements do not overlap with one another in ways that make the plot difficult to interpret.
Example: The dataset mpg (included in the ggplot2 package) contains variables for the specifications of different cars, with 234 observations
ggplot(data = mpg, mapping = aes(x = cyl, y = hwy)) +
geom_point()jitter position adjustment “adds a small amount of random variation to the location of each point” (from ?geom_jitter)ggplot(data = mpg, mapping = aes(x = cyl, y = hwy)) +
geom_point(position = "jitter")“A coordinate system maps the position of objects onto the plane of the plot, and controls how the axes and grid lines are drawn. Plots typically use two coordinates (x,y), but could use any number of coordinates.” (Grammar of Graphics)
Example: Cartesian coordinate system
x1 <- c(1, 10)
y1 <- c(1, 5)
p <- qplot(x = x1, y = y1, geom = "blank", xlab = NULL, ylab = NULL) +
theme_bw()
p +
ggtitle(label = "Cartesian coordinate system")coord_fixed() to fix the scaling of the coordinate systemp +
coord_fixed()coord_flip(). (From R for Data Science)ggplot(data = mpg, mapping = aes(x = class, y = hwy)) +
geom_boxplot()ggplot(data = mpg, mapping = aes(x = class, y = hwy)) +
geom_boxplot() +
coord_flip()Example: Polar coordinate system
p +
coord_polar() +
ggtitle(label = "Polar coordinate system")Facets are subplots that display one subset of the data. They are most commonly used to create “small multiples”
Example: Imagine creating a scatterplot of the relationship between number of cylinders in the engine (x-axis) and highway miles-per-gallon (y-axis), with separate subplots for car class (e.g., midsize, minivan, pickup, suv)
ggplot(data = mpg) +
geom_point(mapping = aes(x = cyl, y = hwy), position = "jitter") +
facet_wrap(~ class, nrow = 2)ggplotggplot() and aes() functionsShow help pages for package ggplot2:
help(package = ggplot2)The ggplot() function:
?ggplot
# SYNTAX AND DEFAULT VALUES
ggplot(data = NULL, mapping = aes())ggplot() initializes a ggplot object. It can be used to declare the input data frame for a graphic and to specify the set of plot aesthetics intended to be common throughout all subsequent layers unless specifically overridden”data: Dataset to use for plot. If not specified in ggplot() function, must be supplied in each layer added to the plot.mapping: Default list of aesthetic mappings to use for plot. If not specified, must be supplied in each layer added to the plot.The aes() function (often called within the ggplot() function):
?aes
# SYNTAX
aes(x, y, ...)ggplot() and in individual layers.”x, y, ...: List of name value pairs giving aesthetics to map to variables
x and y aesthetics are typically omitted because they are so commonExample: Putting ggplot() and aes() together
ggplot() and aes() without specifying a geom layer (e.g., geom_point()) creates a blank ggplot:ggplot(data = diamonds, aes(x = carat, y = price))ggplot(data = diamonds, mapping = aes(x = carat, y = price))data argument of ggplot():class(diamonds)## [1] "tbl_df" "tbl" "data.frame"
diamonds %>% ggplot(mapping = aes(x = carat, y = price))diam_ggplot <- ggplot(data = diamonds, aes(x = carat, y = price))
diam_ggplot # blank ggplottypeof(diam_ggplot)## [1] "list"
class(diam_ggplot)## [1] "gg" "ggplot"
str(diam_ggplot)## List of 9
## $ data : tibble [53,940 × 10] (S3: tbl_df/tbl/data.frame)
## ..$ carat : num [1:53940] 0.23 0.21 0.23 0.29 0.31 0.24 0.24 0.26 0.22 0.23 ...
## ..$ cut : Ord.factor w/ 5 levels "Fair"<"Good"<..: 5 4 2 4 2 3 3 3 1 3 ...
## ..$ color : Ord.factor w/ 7 levels "D"<"E"<"F"<"G"<..: 2 2 2 6 7 7 6 5 2 5 ...
## ..$ clarity: Ord.factor w/ 8 levels "I1"<"SI2"<"SI1"<..: 2 3 5 4 2 6 7 3 4 5 ...
## ..$ depth : num [1:53940] 61.5 59.8 56.9 62.4 63.3 62.8 62.3 61.9 65.1 59.4 ...
## ..$ table : num [1:53940] 55 61 65 58 58 57 57 55 61 61 ...
## ..$ price : int [1:53940] 326 326 327 334 335 336 336 337 337 338 ...
## ..$ x : num [1:53940] 3.95 3.89 4.05 4.2 4.34 3.94 3.95 4.07 3.87 4 ...
## ..$ y : num [1:53940] 3.98 3.84 4.07 4.23 4.35 3.96 3.98 4.11 3.78 4.05 ...
## ..$ z : num [1:53940] 2.43 2.31 2.31 2.63 2.75 2.48 2.47 2.53 2.49 2.39 ...
## $ layers : list()
## $ scales :Classes 'ScalesList', 'ggproto', 'gg' <ggproto object: Class ScalesList, gg>
## add: function
## clone: function
## find: function
## get_scales: function
## has_scale: function
## input: function
## n: function
## non_position_scales: function
## scales: NULL
## super: <ggproto object: Class ScalesList, gg>
## $ mapping :List of 2
## ..$ x: language ~carat
## .. ..- attr(*, ".Environment")=<environment: R_GlobalEnv>
## ..$ y: language ~price
## .. ..- attr(*, ".Environment")=<environment: R_GlobalEnv>
## ..- attr(*, "class")= chr "uneval"
## $ theme : list()
## $ coordinates:Classes 'CoordCartesian', 'Coord', 'ggproto', 'gg' <ggproto object: Class CoordCartesian, Coord, gg>
## aspect: function
## backtransform_range: function
## clip: on
## default: TRUE
## distance: function
## expand: TRUE
## is_free: function
## is_linear: function
## labels: function
## limits: list
## modify_scales: function
## range: function
## render_axis_h: function
## render_axis_v: function
## render_bg: function
## render_fg: function
## setup_data: function
## setup_layout: function
## setup_panel_guides: function
## setup_panel_params: function
## setup_params: function
## train_panel_guides: function
## transform: function
## super: <ggproto object: Class CoordCartesian, Coord, gg>
## $ facet :Classes 'FacetNull', 'Facet', 'ggproto', 'gg' <ggproto object: Class FacetNull, Facet, gg>
## compute_layout: function
## draw_back: function
## draw_front: function
## draw_labels: function
## draw_panels: function
## finish_data: function
## init_scales: function
## map_data: function
## params: list
## setup_data: function
## setup_params: function
## shrink: TRUE
## train_scales: function
## vars: function
## super: <ggproto object: Class FacetNull, Facet, gg>
## $ plot_env :<environment: R_GlobalEnv>
## $ labels :List of 2
## ..$ x: chr "carat"
## ..$ y: chr "price"
## - attr(*, "class")= chr [1:2] "gg" "ggplot"
attributes(diam_ggplot)## $names
## [1] "data" "layers" "scales" "mapping" "theme"
## [6] "coordinates" "facet" "plot_env" "labels"
##
## $class
## [1] "gg" "ggplot"
diam_ggplot$mapping## Aesthetic mapping:
## * `x` -> `carat`
## * `y` -> `price`
diam_ggplot$labels## $x
## [1] "carat"
##
## $y
## [1] "price"
Adding a geometric layer to a ggplot object dictates how observations are displayed in the plot.
geom_point(): creates a scatterplotgeom_bar(): creates a bar chartgeom_point()Scatterplots are most useful for showing the relationship between two continuous variables.
Example: Scatterplot of the relationship between carat and price, using the diamonds dataset
#ggplot(data = diamonds, aes(x = carat, y = price)) + geom_point()
ggplot(data = diamonds, mapping = aes(x = carat, y = price)) + geom_point()diam_ggplot + geom_point()Example: Scatterplot of the relationship between high school math test score (bynels2m) and 2011 earnings (f3ern2011), using the els dataset
els %>% select(bynels2m,f3ern2011) %>%
summarize_all(.funs = list(~ mean(., na.rm = TRUE), ~ min(., na.rm = TRUE), ~ max(., na.rm = TRUE)))## # A tibble: 1 x 6
## bynels2m_mean f3ern2011_mean bynels2m_min f3ern2011_min bynels2m_max
## <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 44.3 21276. -8 -8 79.3
## # … with 1 more variable: f3ern2011_max <dbl>
els %>% select(bynels2m) %>% filter(bynels2m<0) %>% count(bynels2m)## # A tibble: 1 x 2
## bynels2m n
## <dbl+lbl> <int>
## 1 -8 [Survey component legitimate skip/NA] 305
els %>% select(bynels2m) %>% filter(bynels2m<0) %>% count(bynels2m) %>% as_factor()## # A tibble: 1 x 2
## bynels2m n
## <fct> <int>
## 1 Survey component legitimate skip/NA 305
els %>% select(f3ern2011) %>% filter(f3ern2011<0) %>% count(f3ern2011)## # A tibble: 2 x 2
## f3ern2011 n
## <dbl+lbl> <int>
## 1 -8 [Survey component legitimate skip/NA] 459
## 2 -4 [Nonrespondent] 2488
els %>% select(f3ern2011) %>% filter(f3ern2011<0) %>% count(f3ern2011) %>% as_factor()## # A tibble: 2 x 2
## f3ern2011 n
## <fct> <int>
## 1 Survey component legitimate skip/NA 459
## 2 Nonrespondent 2488
NA:els_v2 <- els %>%
mutate(
hs_math = if_else(bynels2m<0,NA_real_,as.numeric(bynels2m)),
earn2011 = if_else(f3ern2011<0,NA_real_,as.numeric(f3ern2011)),
)
#check
els_v2 %>% filter(bynels2m<0) %>% count(bynels2m, hs_math)## # A tibble: 1 x 3
## bynels2m hs_math n
## <dbl+lbl> <dbl> <int>
## 1 -8 [Survey component legitimate skip/NA] NA 305
els_v2 %>% filter(f3ern2011<0) %>% count(f3ern2011, earn2011)## # A tibble: 2 x 3
## f3ern2011 earn2011 n
## <dbl+lbl> <dbl> <int>
## 1 -8 [Survey component legitimate skip/NA] NA 459
## 2 -4 [Nonrespondent] NA 2488
els_v2 %>% count(bypared) %>% as_factor()## # A tibble: 11 x 2
## bypared n
## <fct> <int>
## 1 Missing 49
## 2 Survey component legitimate skip/NA 179
## 3 Nonrespondent 648
## 4 Did not finish high school 944
## 5 Graduated from high school or GED 3053
## 6 Attended 2-year school, no degree 1666
## 7 Graduated from 2-year school 1597
## 8 Attended college, no 4-year degree 1758
## 9 Graduated from college 3468
## 10 Completed Master's degree or equivalent 1786
## 11 Completed PhD, MD, other advanced degree 1049
els_parphd <- els_v2 %>% filter(bypared==8)ggplot(data= els_parphd, aes(x = hs_math, y = earn2011)) + geom_point()The geom_point() function:
?geom_point
# SYNTAX AND DEFAULT VALUES
geom_point(mapping = NULL, data = NULL, stat = "identity",
position = "identity", ..., na.rm = FALSE, show.legend = NA,
inherit.aes = TRUE)geom_point() understands (i.e., accepts) the following aesthetics (required aesthetics in bold)
x, y, alpha, colour, fill, group, shape, size, strokegeom_bar()) accepts a different set of aestheticsExample: Scatterplot of the relationship between engine displacement (displ) and highway miles-per-gallon (hwy), using the mpg dataset
class):ggplot(data = mpg, aes(x = displ, y = hwy, color = class)) +
geom_point()color aesthetic can be specified within geom_point():ggplot(data = mpg, mapping = aes(x = displ, y = hwy)) +
geom_point(mapping = aes(color = class))Student Task: Using the els_parphd dataset, create a scatterplot of the relationship between HS math score (hs_math) on the x-axis and 2011 earnings (earn2011) on the y-axis, with the color of points determined by sex (f1sex)
aes() expects the color aesthetic to be a factor variable:ggplot(data= els_parphd, aes(x = hs_math, y = earn2011, color = f1sex)) + geom_point()ggplot(data= els_parphd, aes(x = hs_math, y = earn2011, color = as_factor(f1sex))) + geom_point()geom_smooth()Why use geom_smooth()?
ggplot(data = els_v2, aes(x = hs_math, y = earn2011)) + geom_point()geom_smooth() creates smoothed prediction lines with shaded confidence intervals:ggplot(data = els_v2, aes(x = hs_math, y = earn2011)) + geom_smooth()The geom_smooth() function:
?geom_smooth
# SYNTAX AND DEFAULT VALUES
geom_smooth(mapping = NULL, data = NULL, stat = "smooth",
position = "identity", ..., method = "auto", formula = y ~ x,
se = TRUE, na.rm = FALSE, show.legend = NA, inherit.aes = TRUE)stat), as compared to that of geom_point():
stat = "smooth" for geom_smooth()stat = "identity" for geom_point()geom_smooth() accepts the following aesthetics (required aesthetics in bold)
x, y, alpha, colour, fill, group, linetype, size, weight, ymax, yminExample: Smoothed prediction lines for high school math test score (bynels2m) versus 2011 earnings (f3ern2011), using the els dataset
ggplot():ggplot(data=els_v2) + geom_smooth(mapping = aes(x = hs_math, y = earn2011))group aesthetic to create separate prediction lines by sex (f1sex):#ggplot(data=els_v2, aes(x = hs_math, y = earn2011, group=as_factor(f1sex))) + geom_smooth()
ggplot(data=els_v2) + geom_smooth(mapping = aes(x = hs_math, y = earn2011, group=as_factor(f1sex)))linetype aesthetic to create separate prediction lines (with different line styles) by sex (f1sex):#ggplot(data=els_v2, aes(x = hs_math, y = earn2011, linetype=as_factor(f1sex))) + geom_smooth()
ggplot(data=els_v2) + geom_smooth(mapping = aes(x = hs_math, y = earn2011, linetype=as_factor(f1sex)))color aesthetic to create separate prediction lines (with different colors) by sex (f1sex):#ggplot(data=els_v2, aes(x = hs_math, y = earn2011, color=as_factor(f1sex))) + geom_smooth()
ggplot(data=els_v2) + geom_smooth(mapping = aes(x = hs_math, y = earn2011, color=as_factor(f1sex)))Example: Layer smoothed prediction lines (geom_smooth()) on top of scatterplot (geom_point())
ggplot(data= els_v2) +
geom_point(mapping = aes(x = hs_math, y = earn2011)) +
geom_smooth(mapping = aes(x = hs_math, y = earn2011))ggplot(data= els_v2, aes(x = hs_math, y = earn2011)) +
geom_point() +
geom_smooth()+ xlim() and + ylim():ggplot(data= els_v2, aes(x = hs_math, y = earn2011)) +
geom_point() +
geom_smooth() +
xlim(c(20,80)) + ylim(c(0,100000))f1sex) on top of scatterplot with different point colors by sex:ggplot(data= els_v2) +
geom_point(mapping = aes(x = hs_math, y = earn2011, color = as_factor(f1sex))) +
geom_smooth(mapping = aes(x = hs_math, y = earn2011, linetype = as_factor(f1sex))) +
xlim(c(20,80)) + ylim(c(0,100000))geom_bar() and geom_col()Bar charts are used to plot a single, discrete variable.
Two geom functions to create bar charts:
geom_bar(): The height of each bar represents the number of cases (i.e., observations) in the group
geom_bar() when using (for example) student-level data and you don’t want to summarize student-level data prior to creating the chartgeom_col(): The height of each bar represents the value of some variable for the group
geom_col() when you have already created an object of summary statistics (e.g., counts, mean value, etc.)The geom_bar() and geom_col() functions:
?geom_bar
# SYNTAX AND DEFAULT VALUES
geom_bar(mapping = NULL, data = NULL, stat = "count",
position = "stack", ..., width = NULL, binwidth = NULL,
na.rm = FALSE, show.legend = NA, inherit.aes = TRUE)
?geom_col
# SYNTAX AND DEFAULT VALUES
geom_col(mapping = NULL, data = NULL, position = "stack", ...,
width = NULL, na.rm = FALSE, show.legend = NA,
inherit.aes = TRUE)Example: Bar chart with the variable cut (e.g., “Fair,” “Good,” “Ideal”) as x-axis and number of diamonds as y-axis, using the diamonds dataset
diamonds %>% count(cut)## # A tibble: 5 x 2
## cut n
## <ord> <int>
## 1 Fair 1610
## 2 Good 4906
## 3 Very Good 12082
## 4 Premium 13791
## 5 Ideal 21551
Method 1: Create bar chart using geom_bar()
ggplot(data = diamonds, aes(x = cut)) +
geom_bar()Method 2: Create bar chart using geom_col()
cut:cut_count <- diamonds %>% count(cut)
cut_count## # A tibble: 5 x 2
## cut n
## <ord> <int>
## 1 Fair 1610
## 2 Good 4906
## 3 Very Good 12082
## 4 Premium 13791
## 5 Ideal 21551
ggplot() + geom_col to plot the data from the object cut_count:ggplot(data = cut_count, aes(x = cut, y=n)) +
geom_col()cut_count object first:#diamonds %>% count(cut) %>% str()
diamonds %>% count(cut) %>% str()## tibble [5 × 2] (S3: tbl_df/tbl/data.frame)
## $ cut: Ord.factor w/ 5 levels "Fair"<"Good"<..: 1 2 3 4 5
## $ n : int [1:5] 1610 4906 12082 13791 21551
diamonds %>% count(cut) %>% ggplot(aes(x= cut, y=n)) +
geom_col()Student Task: Using the els_v2 dataset, create a bar chart with the variable “ever attended postsecondary education” (f2evratt) as x-axis and number of students as y-axis
els_v2 %>% count(f2evratt) %>% as_factor()## # A tibble: 5 x 2
## f2evratt n
## <fct> <int>
## 1 Survey component legitimate skip/NA 359
## 2 Nonrespondent 1691
## 3 Item legitimate skip/NA 108
## 4 No 3505
## 5 Yes 10534
Method 1: Create bar chart using geom_bar()
ggplot(data = els_v2, aes(x = as_factor(f2evratt))) +
geom_bar()f2evratt before plotting:els_v2 %>% filter(f2evratt>=0) %>% ggplot(aes(x = as_factor(f2evratt))) +
geom_bar()Method 2: Create bar chart using geom_col()
els_v2 %>%
# filter to remove missing values
filter(f2evratt>=0) %>%
# use count() to create summary statistics object
count(f2evratt) %>%
# plot summary statistic object
ggplot(aes(x=as_factor(f2evratt), y=n)) + geom_col()Facets divide a plot into subplots based on the values of one or more discrete variables. They are most commonly used to create “small multiples”
Two functions to split your plots into facets:
facet_grid(): Display subplots in grid format, where rows and columns are determined by the faceting variable(s)
facet_grid() is most useful when you have two discrete variables, and all combinations of the variables exist in the datafacet_wrap(): Display all subplots side-by-side, but can be wrapped to fill multiple rows
facet_wrap() generally has better use of screen space, and you can specify the number of plots in each row or columnThe facet_grid() and facet_wrap() functions:
?facet_grid
# SYNTAX AND DEFAULT VALUES
facet_grid(rows = NULL, cols = NULL, scales = "fixed",
space = "fixed", shrink = TRUE, labeller = "label_value",
as.table = TRUE, switch = NULL, drop = TRUE, margins = FALSE,
facets = NULL)
?facet_wrap
# SYNTAX AND DEFAULT VALUES
facet_wrap(facets, nrow = NULL, ncol = NULL, scales = "fixed",
shrink = TRUE, labeller = "label_value", as.table = TRUE,
switch = NULL, drop = TRUE, dir = "h", strip.position = "top")Specifying which variable(s) to facet your plot on:
facet_grid()
facet_grid() arranges subplots in a grid format, we need to specify how we define the rows and columnsrows and cols arguments, which should be variables quoted by vars()
facet_grid(rows = vars(<var_1>), cols = vars(<var_2>)): facet into both rows and columnsfacet_grid(rows = vars(<var_1>)): facet into rows onlyfacet_grid(cols = vars(<var_1>)): facet into columns only<row_var> ~ <col_var>
facet_grid(<var_1> ~ <var_2>): facet into both rows and columnsfacet_grid(<var_1> ~ .): facet into rows onlyfacet_grid(. ~ <var_1>): facet into columns onlyfacet_wrap()
facet_wrap() also accepts a formula for its facets argument
facet_wrap(~ <var_1>): facet by one variablefacet_wrap(<var_1> ~ <var_2>): facet on the combination of two variablesExample: Scatterplot of the relationship between engine displacement (displ) and highway miles-per-gallon (hwy), faceted by number of cylinders (cyl), from the mpg dataset
Method 1: Faceting using facet_grid()
# Facet into rows
ggplot(data = mpg, aes(x = displ, y = hwy)) +
geom_point() +
facet_grid(rows = vars(cyl))# Facet into columns
ggplot(data = mpg, aes(x = displ, y = hwy)) +
geom_point() +
facet_grid(cols = vars(cyl))# Facet into rows
ggplot(data = mpg, aes(x = displ, y = hwy)) +
geom_point() +
facet_grid(cyl ~ .)# Facet into columns
ggplot(data = mpg, aes(x = displ, y = hwy)) +
geom_point() +
facet_grid(. ~ cyl)Method 2: Faceting using facet_wrap()
facet_grid(), facet_wrap() is not restricted to either rows or columns:ggplot(data = mpg, aes(x = displ, y = hwy)) +
geom_point() +
facet_wrap(~ cyl)ggplot(data = mpg, aes(x = displ, y = hwy)) +
geom_point() +
facet_wrap(~ cyl, nrow = 1)ggplot(data = mpg, aes(x = displ, y = hwy)) +
geom_point() +
facet_wrap(~ cyl, ncol = 1)Example: Scatterplot of the relationship between engine displacement (displ) and highway miles-per-gallon (hwy), faceted by number of cylinders (cyl) and type of car (class), from the mpg dataset
Method 1: Faceting using facet_grid()
cyl and the columns based on class:ggplot(data = mpg, aes(x = displ, y = hwy)) +
geom_point() +
facet_grid(rows = vars(cyl), cols = vars(class))ggplot(data = mpg, aes(x = displ, y = hwy)) +
geom_point() +
facet_grid(cyl ~ class)Method 2: Faceting using facet_wrap()
facet_wrap() is not defined by rows and columns, it omits any subplots that do not display any data:ggplot(data = mpg, aes(x = displ, y = hwy)) +
geom_point() +
facet_wrap(cyl ~ class)ggplot(data = mpg, aes(x = displ, y = hwy)) +
geom_point() +
facet_wrap(cyl ~ class, nrow = 3)ggplot(data = mpg, aes(x = displ, y = hwy)) +
geom_point() +
facet_wrap(cyl ~ class, ncol = 4)There are many ways to customize the display of our plot. For this section, we will build upon this scatterplot we saw earlier:
diamonds %>%
ggplot(mapping = aes(x = carat, y = price)) +
geom_point()Functions to add title and axis labels:
ggtitle(): Add title of graphxlab(): Add x-axis labelylab(): Add y-axis labelExample: Adding title and axis labels
diamonds %>%
ggplot(mapping = aes(x = carat, y = price)) +
geom_point() +
ggtitle('Correlation between diamond carat and price') +
xlab('Carat') + ylab('Price')The scale_x_continuous() and scale_y_continuous() functions:
?scale_x_continuous
?scale_y_continuous
# SYNTAX AND DEFAULT VALUES
scale_x_continuous(
name = waiver(),
breaks = waiver(),
minor_breaks = waiver(),
n.breaks = NULL,
labels = waiver(),
limits = NULL,
expand = waiver(),
oob = censor,
na.value = NA_real_,
trans = "identity",
guide = waiver(),
position = "bottom",
sec.axis = waiver()
)
scale_y_continuous(
name = waiver(),
breaks = waiver(),
minor_breaks = waiver(),
n.breaks = NULL,
labels = waiver(),
limits = NULL,
expand = waiver(),
oob = censor,
na.value = NA_real_,
trans = "identity",
guide = waiver(),
position = "left",
sec.axis = waiver()
)scale_x_continuous() and scale_y_continuous() are the default scales for continuous x and y aesthetics.”name: The name of the scale. Used as the axis or legend title.labels: Custom labelling of the scales (i.e., ticks)limits: Limits of the scale (i.e., min/max values)position: The position of the axis. ('left' or 'right' for y axes, 'top' or 'bottom' for x axes)
The label_number() function:
?label_number
# SYNTAX AND DEFAULT VALUES
label_number(
accuracy = NULL,
scale = 1,
prefix = "",
suffix = "",
big.mark = " ",
decimal.mark = ".",
trim = TRUE,
...
)label_number() force decimal display of numbers (i.e. don’t use scientific notation)”accuracy: A number to round to (e.g. use 0.01 to show 2 decimal places of precision)scale: A scaling factor (e.g., x will be multiplied by scale before formatting)prefix: Symbols to display before valuesuffix: Symbols to display after value
Example: Formatting numbers on the y-axis
We can use scale_y_continuous(), in conjunction with label_number() from the scales package, to format the numbers on the y-axis:
prefix to add $ before the numbersuffix to add K after the numberscale of 1e-3 to divide number by 1000accuracy of 1 to round number to the ones digitdiamonds %>%
ggplot(mapping = aes(x = carat, y = price)) +
geom_point() +
ggtitle('Correlation between diamond carat and price') +
xlab('Carat') + ylab('Price') +
scale_y_continuous(labels = label_number(prefix = '$', suffix = 'K', scale = 1e-3, accuracy = 1))There are several ways to customize the color palettes of the plot, including scale_color_brewer() for discrete scale and scale_color_gradient() for continuous scale.
The scale_color_brewer() function:
?scale_color_brewer
# SYNTAX AND DEFAULT VALUES
scale_color_brewer(
...,
type = "seq",
palette = 1,
direction = 1,
aesthetics = "colour"
)brewer scales provides sequential, diverging and qualitative colour schemes from ColorBrewer”palette: Name of the color palette (see below)direction: 1 for default ordering, -1 for reverse ordering
Example: Customizing color palette of discrete scale
Let’s color the points by the diamond color. This is the default display:
diamonds %>%
ggplot(mapping = aes(x = carat, y = price, color = color)) +
geom_point() +
ggtitle('Correlation between diamond carat and price') +
xlab('Carat') + ylab('Price') +
scale_y_continuous(labels = label_number(prefix = '$', suffix = 'K', scale = 1e-3, accuracy = 1))We can use scale_color_brewer() to customize the color palette. This also accepts other arguments for labeling including name to specify the legend title:
diamonds %>%
ggplot(mapping = aes(x = carat, y = price, color = color)) +
geom_point() +
ggtitle('Correlation between diamond carat and price') +
xlab('Carat') + ylab('Price') +
scale_y_continuous(labels = label_number(prefix = '$', suffix = 'K', scale = 1e-3, accuracy = 1)) +
scale_color_brewer(palette = 'Spectral', name = 'Color')
The scale_color_gradient() function:
?scale_color_gradient
# SYNTAX AND DEFAULT VALUES
scale_color_gradient(
...,
low = "#132B43",
high = "#56B1F7",
space = "Lab",
na.value = "grey50",
guide = "colourbar",
aesthetics = "colour"
)scale_*_gradient creates a two colour gradient (low-high)”low: Color for low end of the gradienthigh: Color for high end of the gradient
Example: Customizing color palette of continuous scale
Let’s color the points by the diamond depth percentage. This is the default display:
diamonds %>%
ggplot(mapping = aes(x = carat, y = price, color = depth)) +
geom_point() +
ggtitle('Correlation between diamond carat and price') +
xlab('Carat') + ylab('Price') +
scale_y_continuous(labels = label_number(prefix = '$', suffix = 'K', scale = 1e-3, accuracy = 1))We can use scale_color_gradient() to customize the color palette. This also accepts other arguments for labeling including name to specify the legend title and labels to customize the legend values:
diamonds %>%
ggplot(mapping = aes(x = carat, y = price, color = depth)) +
geom_point() +
ggtitle('Correlation between diamond carat and price') +
xlab('Carat') + ylab('Price') +
scale_y_continuous(labels = label_number(prefix = '$', suffix = 'K', scale = 1e-3, accuracy = 1)) +
scale_color_gradient(low = 'white', high = 'purple', name = 'Depth percentage',
labels = label_number(suffix = '%'))To customize the display of the plot, ggplot offers several preset themes, including:
theme_grey() (default)theme_bw()theme_light()theme_dark()theme_minimal()theme_classic()Example: Using preset theme
diamonds %>%
ggplot(mapping = aes(x = carat, y = price, color = color)) +
geom_point() +
ggtitle('Correlation between diamond carat and price') +
xlab('Carat') + ylab('Price') +
scale_y_continuous(labels = label_number(prefix = '$', suffix = 'K', scale = 1e-3, accuracy = 1)) +
scale_color_brewer(palette = 'Spectral', name = 'Color') +
theme_minimal()
We can also use theme() to customize specific components of the plot.
Example: Using custom theme
diamonds %>%
ggplot(mapping = aes(x = carat, y = price, color = color)) +
geom_point() +
ggtitle('Correlation between diamond carat and price') +
xlab('Carat') + ylab('Price') +
scale_y_continuous(labels = label_number(prefix = '$', suffix = 'K', scale = 1e-3, accuracy = 1)) +
scale_color_brewer(palette = 'Spectral', name = 'Color') +
theme(
text = element_text(size = 8),
panel.background = element_blank(),
plot.title = element_text(color = '#444444', size = 10, hjust = 0.5, face = 'bold'),
axis.ticks = element_blank(),
axis.title = element_text(face = 'bold'),
legend.title = element_text(face = 'bold'),
legend.key = element_blank(),
legend.key.size = unit(0.5, 'cm')
)The plots generated by ggplot can be exported as a PDF, PNG, or other file types. (From Creating and Saving Graphs - R Base Graphs)
In RStudio, the generated plots will typically be displayed in the lower right panel. There is an Export button that allows you to save the plot as a PDF or PNG:
There are also various R functions, including jpeg(), png(), svg(), and pdf(), for exporting plots.
The steps for saving a plot:
height and width for specifying image dimensiondev.off()Example: Exporting plot using pdf()
# Open the file
pdf('Rplot.pdf')
# Create the plot
ggplot(data = mpg, aes(x = displ, y = hwy, color = class)) +
geom_point()
# Close the file
dev.off()Example: Exporting plot using jpeg()
# Open the file
jpeg('Rplot.jpg', width = 350, height = 350)
# Create the plot
ggplot(data = mpg, aes(x = displ, y = hwy, color = class)) +
geom_point()
# Close the file
dev.off()